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Abstract 

Hierarchies are frequently used for the organiza¬ 
tion of objects. Given a hierarchy of classes, two 
main approaches are used, to automatically clas¬ 
sify new instances: flat classification and cascade 
classification. Flat classification ignores the hi¬ 
erarchy, while cascade classification greedily tra¬ 
verses the hierarchy from the root to the pre¬ 
dicted leaf. In this paper we propose a new ap¬ 
proach, which extends cascade classification to 
predict the right leaf by estimating the proba¬ 
bility of each root-to-leaf path. We provide ex¬ 
perimental results which indicate that, using the 
same classification algorithm, one can achieve 
better results with our approach, compared to 
the traditional flat and cascade classifications. 

1 Introduction 

Machine learning is often used to estimate classi¬ 
fication models for a set of predefined categories. 
Most of the times, these categories are assumed 


to be independent. When independence cannot 
be assumed we may either construct artificial 
hierarchies (hierarchical clustering), or classify 
new instances onto a hierarchy that is given, typ¬ 
ically representing is-a relations. 

In this paper we study cases where the hierar¬ 
chy is already provided. Furthermore, the hier¬ 
archy is a tree and the classification nodes are al¬ 
ways the leaves of the hierarchy. We also assume 
that each instance belongs to only one category 
(single-label classification). 

Many researchers approach hierarchical classi¬ 
fication problems ei m using flat classification, 
i.e. ignoring the hierarchy. 

Hierarchical classification approaches, typi¬ 
cally divide the problem into smaller ones, usu¬ 
ally one classification for each node of the hier¬ 
archy. For each of these problems fewer features 
and instances are required to train a good classi¬ 
fier, compared to the respective fiat approaches. 
This can be very important, especially in cases 
of large scale classification, where the number 
of categories and instances can increase to thou- 
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sands and millions respectively. In such cases, a 
hierarchical approach would require much fewer 
resources than a flat one. 

The main issue in hierarchical classification 
is to combine the decisions of the node-specific 
classifiers appropriately, in order to predict a cat¬ 
egory for an instance. The most common ap¬ 
proach is that of cascade classification. In this 
case, we start at the root of the hierarchy and 
greedily select the most probable descendant. 
This continues until we reach a leaf, which is 
chosen as the predicted node. The main dis¬ 
advantage of this approach is that any mistake 
done during the descent deterministically leads 
to the wrong final decision. Therefore the cas¬ 
cade is very sensitive to the quality of the inner 
node classifiers. In this paper we propose a new 
approach, which is as fast as cascade regarding 
training but leads to better results compared to 
cascade and flat classification, using the same 
classification algorithms. 

In the next Section we present the related 
work, while in Section 3 we introduce our ap¬ 
proach. Section 4 discusses our experimental re¬ 
sults. Finally, Section 5 concludes and points to 
future work. 

2 Related Work 

Although hierarchical classification has many 
advantages, typically researchers resort to mildly 
hierarchical or even flat approaches [3]. One rea¬ 
son for this is that flat classification is well stud¬ 
ied, so it is easier to transfer methods from this 
field. On the other hand on large scale prob¬ 
lems, the flat use of traditional classifiers, such 
as SVMs, is often prohibitively expensive com¬ 
putationally [3]. 

Early work in hierarchical classification fo¬ 


cused on approaches such as shrinkage [5] and 
hierarchical mixture models |7j. Unfortunately 
most of these approaches cannot be applied to 
large scale problems, at least in the form de¬ 
scribed in the original papers. New methods 
based on similar ideas, such as that of latent 
concepts [BJ, continue to appear in the literature, 
taking also into account scalability issues. But 
still most of the proposed methods are tested on 
rather small datasets with small hierarchies. 

Mildly hierarchical approaches, typically make 
limited use of the hierarchy. Methods such as [9j 
use only some levels of the hierarchy, flattening 
the rest. Other approaches such as [I], alter the 
initial hierarchy before performing cascading in 
order to minimize errors at the upper levels of 
the hierarchy. 

3 Probabilistic Cascading 

In our method following the cascading approach, 
we train one binary classifier for each node of 
the hierarchy. For example, using the hierar¬ 
chy of Figure [T] we would train one classifier for 
each of the nodes Arts, Health, Music, Dance, 
Fitness and Medicine. The binary classifier of 
a node N is trained using as positive examples 
the instances belonging to the leaf descendants 
of N and as negative examples the instances of 
its siblings. For example, the binary classifier 
of node Music would use all instances belonging 
to Music as positive examples and all instances 
belonging to Dance as negative examples. Sim¬ 
ilarly for the binary classifier of node Arts, all 
instances belonging to Music and Dance would 
be positive examples, while all instances belong¬ 
ing to Fitness and Medicine would be negative. 

These binary classifiers require fewer resources 
to be trained compared to flat ones. They can 
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but since P(Arts\Music, d) = 1: 



Figure 1: Tree hierarchy example. 


P(Music\d) = P(Arts\d)P(Music\Arts, d) 


( 2 ) 


Similarly: 

P(Art s \d) = moot\d)P(Arts\Root.d) 
P(Root\Arts, d) 

but since P(Music\Root, d) = 1 and P(Root\d) 
= 1 : 


P(Arts\d) = P(Arts\Root,d ) (4) 


also be more accurate, since they aim to distin¬ 
guish between fewer categories. For example, 
if we have 10,000 leaves, each binary classifier 
would need to separate one class from 9.999 oth¬ 
ers. In the case of cascading,it would only need 
to separate between the sibling categories. Such 
classifiers would also require fewer features to 
train on, an important characteristic if we con¬ 
sider large datasets. 

The main disadvantage of cascading is that 
any mistake is carried over. For example if an 
instance belonging to category of Music, gets a 
higher probability by the classifier of Health than 
that of Arts, is classified wrongly, without tak¬ 
ing into consideration the classifiers of Music and 
Dance. In contrast, our method computes the 
probability of each root-to-leaf path for a testing 
instance and we classify it to the most probable 
path, which we call P pa th ■ As an example, the 
probability of an instance d belonging to Music: 


By combining (2) and (4) we get: 

P(Music\d) = P(Arts\Root,d)P(Music\Arts,d ) 

(5) 

These conditional probabilities are in fact the 
ones computed by the binary classifiers of each 
node. So given a document d, a leaf C and a set 
S of all the ancestors of C: 

\S\ 

P(C\d) = Y\_ P(Si\ Ancestor (Si), d) (6) 

2—1 

and we define P pa th as the: 

Ppath (d) = argrna xP(C\d) (7) 

C 

Let’s get back to our initial example where 
document d belonged to Music. Lets assume 
that we have the following probabilities: 

• P(Arts\Root,d) = 0.2 


P(Music\d) 


P(Arts\d)P(Music\Arts, d) 
P(Arts\Music, d) 


(1) 


• P(Health\Root, d) =0.21 

• P(Music\Arts, d) = 0.9 

• P(Dance\Arts, d) = 0.6 
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• P(Fitness\Health, d) = 0.1 

• P(Medicine\Health, d) =0.2 

If we used standard cascading, document d 
would be classified to category Medicine. Using 
Ppath we get: 

• P(Music\d) = 0.18 

• P(Dance\d) = 0.12 

• P(Fitness\d) = 0.021 

• P(Medicine\d) = 0.042 

and Ppath would assign d to class Music. The 
cost that we have to pay, compared to standard 
cascading, is that we have to compute all the 
P(C\d), in order to select the one with the high¬ 
est probability. 

4 Experimental results 

In order to compare our approach against flat 
and cascade classification, we used the Task 
1 dataset of the first Large Scale Hierarchical 
Text Classification Challenge (LSHTCl)Q This 
dataset contains 93,505 instances (split into train 
and validation files), composed of 55,765 dis¬ 
tinct features and belonging to 12,294 categories. 
Classification is only allowed to the leaves of the 
hierarchy, which is a tree. Each instance belongs 
to only one category. The testing instances are 
34,880 and the results are evaluated using the 
evaluation measures of the challenge (an Oracle 
is provided by the organizers) which are the fol¬ 
lowing: 

• Accuracy 

Tttp: //lshtc.iit.demokritos.gr/node /1 


• Macro F-measure 

• Macro Precision 

• Macro Recall 

• Tree Induced Error 

As a classifier we used a L2 Regularized Lo¬ 
gistic Regression with the regularization param¬ 
eter C set to 1 (usually the default value). We 
also conducted experiments with other regular¬ 
ization methods and other values of C, but the 
results were similar. All the experiments were 
conducted using TF/IDF instead TF features, 
as our experiments indicated better performance 
with this feature set. 

The goal of our experiments was to illustrate 
that the proposed method can improve the re¬ 
sults of flat and cascade classification, using the 
same algorithm, L2 Regularized Logistic Regres¬ 
sion in this case. Further experimentation and 
engineering could make the method competitive 
to the best-performing systems, in the challenge. 
However, we consider this exercise beyond the 
scope of the paper. 

For flat classification, we trained one binary 
classifier (one versus all) for each leaf. We then 
assigned each testing document to the class with 
the highest probability. For cascade classifica¬ 
tion we trained a binary classifier for each node 
of the hierarchy. We used as positive examples, 
all the instances belonging to all the descendant 
leaves of the node and as negative, all the de¬ 
scendant leaves of its siblings. This results in 
more classifiers than the for flat classification, 
but each of these classifiers was much easier to 
train, since it was trained on fewer instances. 

In table [I] we present the results of each ap¬ 
proach, for each evaluation measure. The main 
observation is that P pa th outperforms both Flat 
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and Cascade. Another interesting result is, that 
Flat is the worst approach, according to Tree 
Induced Error. This is an indication that by ig¬ 
noring the hierarchy (flat classification), the mis¬ 
takes tend to be located further from the correct 
category in the hierarchy. This is very impor¬ 
tant in hierarchical classification, since different 
mistakes carry different weight. Misclassifying 
an instance to a sibling of the correct category is 
a smaller error than if it was classified to a cat¬ 
egory 5 nodes away. Flat evaluation measures, 
generally fail to capture this, so tree induced er¬ 
ror, being the only hierarchical measure of the 
five that we use is more suitable for comparing 
the three approaches. 

Given that our hierarchy is a tree and each 
instance belongs to only a single class, there is 
no need to take into account more complex hi¬ 
erarchical evaluation measures and tree induced 
error is sufficient for safe conclusions. 


is very important, for a realistic semi-automated 
classification scenario, where a human annota¬ 
tor selects the correct label between thousands 
of categories. Such a system would allow the 
annotator to select only between five or ten sug¬ 
gestions. The second observation is that for all 
values of K, P pa th performs better than the Flat 
one. 
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Evaluation Measure 

Flat 

Cascade 

Ppath 

Accuracy 

0.405 

0.404 

0.431 

Macro F-measure 

0.256 

0.278 

0.294 

Macro Precision 

0.254 

0.269 

0.287 

Macro Recall 

0.302 

0.289 

0.302 

Tree Induced Error 

3.874 

3.609 

3.437 


Table 1: Results for each approach per evalua¬ 
tion measure, using TF/IDF features. With bold 
we mark the best performing approach, given 
each evaluation measure. 

Both Ppath and Flat classification produce a 
probability for each leaf and the highest one is 
returned as the predicted category. But what 
if we evaluated the list of categories, ranked ac¬ 
cording o their probability? In order to obtain 
such an assessment in Figure 0] we calculate the 
recall for the K nrost-probable categories, with 
K ranging from 1 to 10. As expected the prob¬ 
ability of success increases rapidly with K. This 


Figure 2: Recall at the top K answers of P pa th 
and Flat classification for various values of K. 

Regarding the scalability of the approaches, 
during the two cascading approaches (standard 
and Ppath.) require fewer resources than the flat 
classifiers. During classification, Ppath is slower 
than Cascade, since it takes into account all the 
root-to-leaf paths, and is similar to the cost of 
Flat classification. 

5 Conclusions 

In this paper we present the P pa th method for hi¬ 
erarchical classification. P pa th addresses the dis¬ 
advantages of traditional flat and cascade classi¬ 
fication. Flat classification can be very compu¬ 
tational demanding in large scale problems and 
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also ignores completely the hierarchy informa¬ 
tion which can be exploited for better results. 
Standard cascading on the other hand is much 
more computational efficient, but suffers from 
the problem of early misclassification at the top 
levels of the hierarchy. 

Our approach has the same training computa¬ 
tional complexity as the Cascade, while achiev¬ 
ing better scores according to all the tested eval¬ 
uation measures. However, it is slower during 
classification, having a complexity is similar to 
that of flat classification. 

The version presented in this paper is designed 
for tree hierarchies. As a future work, we plan 
to extend the idea of P pa th to DAG hierarchies. 
Furthermore in this paper we focused on single¬ 
label classification. Although the idea of P pa th 
seems compatible with multi-label approaches, 
further experiments need to be conducted in this 
direction. 
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